2018-03-14

Outline

  • Intro
  • Genomic approaches for purity estimation
    • Variant allele freq. + copy number profiles
    • Methylation
    • RNA-seq
  • Understanding relationship between VAF, purity and CN
  • Getting your own purity estimates (when necessary)
  • How to use purity estimates in genomic analyses

Tumor purity

  • Infiltrating lymphocytes and stromal cells are ubiquitous in solid tumor samples
  • These reduce "tumor signal" in genomic analyses and can introduce bias
  • TCGA QC threshold for inclusion was >60% tumor purity (visual counting)
  • According to ASCAT estimates (COSMIC DB), only 46% of samples meet it.

Purity from VAF + CN

Well known softwares:

  • ABSOLUTE (Carter et al. 2012)
  • ASCAT (Van Loo et al. 2010)

What they do:

  • Assign absolute copy numbers to each genomic region
  • Implies estimating purity (cancer cell fraction)
  • Can also estimate variant copy numbers (ABSOLUTE)

Requires:

  • log2R (from snp arrays, WGS, WEX)
  • BAF (Germline AF, snp arrays or WGS)
  • VAF (WXS,WGS)

Cons:

  • Computationally intensive (hours per sample)
  • Multiple solutions are possibles (manual selection)

ABSOLUTE vs ASCAT (SKCM only)

ABSOLUTE vs ASCAT (SKCM only)

ABSOLUTE vs ASCAT (1220 samples)

Methylation (Zhang et al. 2015)

  • Use beta values [0,1]

* Results shown on training samples

ASCAT vs Infinium (5689 samples)

  • Comparison includes several new samples

ASCAT vs Infinium (5689 samples)

Infinium measure "cancer DNA fraction" rather than "cancer cell fraction"

ASCAT vs Infinium (5689 samples)

  • Worst case scenarios (new cancer types)

RNA-seq

Performs poorly …

  • Expression is cancer specific (needs training)
  • Expression values not bounded + high variance
  • Expression vs purity is not linear
  • Expression is less correlated than beta values
  • Expression values can come from different pipelines

VAF vs purity and copy number

Given a pen and a piece of paper, it is relatively straightforward to get the expected VAF as a function of purity and copy number:

expected_VAF = function(p, mc = 1, CN_T = 2, CN_N = 2){
  (mc * p) / (CN_T * p + CN_N * (1 - p) )
}
  • p : purity (cancer cell fraction)
  • mc : mutated copy number
  • CN_T : copy number in tumor
  • CN_N : copy number in normal

Simple diploid case

With a chromosome loss

With a chromosome gain

With genome doubling (4 copies)

Purity vs VAF and copy number

Using a bit of elementary algebra, we can express expected purity as a function of VAF and copy number :

expected_purity = function(VAF, mc = 1, CN_T = 2, CN_N = 2){ 
  CN_N * VAF/(mc + VAF * (CN_N - CN_T)) 
  }
  • VAF
  • mc : mutated copy number
  • CN_T : copy number in tumor
  • CN_N : copy number in normal

Purity vs VAF and copy number

Log2R vs copy number and purity

  • p : purity
  • CN_T : copy number in tumor
  • CN_N : copy number in normal
  • CN_R : reference copy number in tumor
expected_log2R = function(p, CN_T = 2, CN_N = 2, CN_R = 2){
  log2( CN_T * p + CN_N * (1-p) ) - log2( CN_R * p + CN_N * (1-p) )
}

expected_purity_from_log2R = function(log2R, CT, CN = 2, CTR = 2){ 
  ((2^{log2R} - 1) * CN) / (CT - CN + 2^{log2R} * (CN - CTR)) 
}

Simple diploid case

Median centered on 3 copies

Genome doubling

Sample's median VAF (n = 10102)

Most variants are heterozyguous (1:2)

Using median VAF (1220 samples)

purity = min(median(VAF) * 2 , 1)

  • Same samples than for ABSOLUTE vs ASCAT comparison

Using CNA

Using CNA (1220 samples)

  • Same samples than for ASCAT vs ABSOLUTE comparison

VAF and CNA (1220 samples)

  • purity_cna = (1 - 2^{log2R}) * 2 # If > 0.2
  • purity_cna_vaf = mean(purity_cna, purity_VAF)

  • Same samples than for ASCAT vs ABSOLUTE comparison

VAF and CNA (7372 samples)

  • All samples available for ASCAT

Difference with ASCAT distribution (7365 samples)

Difference < 25% for 89% of samples

Using methylation (SKCM only)

Two probes most highly correlated with absolute purity

Using methylation (SKCM only)

  • Simple linear model on best two probes

TCGA purity distribution (10102 samples)